In this problem set we will exercise some of the unsupervised learning approaches on 2018 Global Health Observatory (GHO) data. It is available at that website in the form of Excel file, but its cleaned up version ready for import into R for further analyses is available at CSCI E-63C canvas course web site whs2018_AnnexB-subset-wo-NAs.txt. The cleaning and reformatting included: merging data from the three parts of Annex B, reducing column headers to one line with short tags, removal of “>”, “<” and whitespaces, conversion to numeric format, removal of the attributes with more than 20% of missing values and imputing the remaining missing values to their respective medians. You are advised to save yourself that trouble and start from preformatted text file available at the course website as shown above. The explicit mapping of variable names to their full description as provided in the original file is available in Excel file whs2018_AnnexB-subset-wo-NAs-columns.xls also available on the course canvas page. Lastly, you are advised to download a local copy of this text file to your computer and access it there (as opposed to relying on R ability to establish URL connection to canvas that potentially requires login etc.)
Short example of code shown below illustrates reading this data from
a local copy on your computer (assuming it has been copied into current
working directory of your R session – getwd() and
setwd() commands are helpful to find out what is it
currently and change it to desired location) and displaying summaries
and pairs plot of five (out of almost 40) arbitrary chosen variables.
This is done for illustration purposes only – the problems in this set
expect use of all variables in this dataset.
whs2018annexBdat <- read.table("whs2018_AnnexB-subset-wo-NAs.txt",sep="\t",header=TRUE,quote="")
summary(whs2018annexBdat[,c(1,4,10,17,26)])
## TotalPopulation LifeExpectancyB ChildMortality Alcohol PhysicianDensity
## Min. : 2 Min. :52.90 Min. : 2.100 Min. : 0.000 Min. :0.000
## 1st Qu.: 1973 1st Qu.:66.25 1st Qu.: 7.725 1st Qu.: 2.500 1st Qu.:0.400
## Median : 8557 Median :73.40 Median : 16.950 Median : 6.300 Median :1.500
## Mean : 38300 Mean :71.88 Mean : 30.238 Mean : 6.165 Mean :1.741
## 3rd Qu.: 28051 3rd Qu.:76.47 3rd Qu.: 46.925 3rd Qu.: 9.375 3rd Qu.:2.800
## Max. :1411415 Max. :84.20 Max. :132.500 Max. :15.200 Max. :7.500
pairs(whs2018annexBdat[,c(1,4,10,17,26)])
In a way this dataset is somewhat similar to the
USArrests dataset extensively used in ISLR labs and
exercises – it collects various continuous statistics characterizing
human population across different territories. It is several folds
larger though – instead of 50 US states and 4 attributes in
USArrests, world health statistics (WHS) data characterizes
194 WHO member states by 36 variables. Have fun!
The following problems are largely modeled after labs and exercises from Chapter 10 ISLR. If anything presents a challenge, besides asking questions on piazza (that is always a good idea!), you are also encouraged to review corresponding lab sections in ISLR Chapter 10.
The goal here is to appreciate the impact of scaling of the input variables on the result of the principal components analysis. To that end, you will first survey means and variances of the attributes in this dataset (sub-problem 1a) and then obtain and explore results of PCA performed on data as is and after centering and scaling each attribute to zero mean and standard deviation of one (sub-problem 1b).
Compare means and variances of the untransformed attributes
in the world health statisics dataset – plot of variance vs. mean is
probably the best given the number of attributes in the dataset.
Function apply allows to apply desired function
(e.g. mean or var or sd) to each
row or column in the table. Do you see all 36 attributes in the plot, or
at least most of them? (Remember that you can use
plot(inpX,inpY,log="xy") to use log-scale on both
horizontal and vertical axes.) Is there a dependency between attributes’
averages and variances? What is the range of means and variances when
calculated on untransformed data? Which are the top two attributes with
the highest mean or variance? What are the implications for PCA
rendition of this dataset (in two dimensions) if applied to
untransformed data?
data = whs2018annexBdat
#means & variances
means = lapply(data, mean)
vars = lapply(data, var)
plot(means, vars, log = "xy", main = "variance vs mean (log)")
#range of means and variances
paste('Range of means:', range(means)[1], '-', range(means)[2])
## [1] "Range of means: 0.194845360824742 - 7732494.60824742"
paste('Range of variances:', range(vars)[1], '-', range(vars)[2])
## [1] "Range of variances: 0.131374659473319 - 1287990390309082"
#sort means and variances
means.sort = means[order(unlist(means), decreasing = T)]
vars.sort = vars[order(unlist(vars), decreasing = T)]
#top 2 attributes
paste('Top 2 Attributes:', names(means.sort[1]), ',', names(means.sort[2]))
## [1] "Top 2 Attributes: NTDinterventions , TotalPopulation"
Mean and variance shows a positive, linear dependency. The PCA rendition of this dataset will likely have NTDinterventions and Total Population as the first 2 components, as they are the top 2 attributes for mean and variance.
Perform the steps outlined below both using
untransformed data and scaled attributes in WHS
dataset (remember, you can use R function prcomp to run PCA
and to scale data you can either use as input to prcomp the
output of scale as applied to the WHS data matrix or call
prcomp with parameter scale set to
TRUE). To make it explicit, the comparisons outlined below
have to be performed first on the unstransformed WHS data and then again
on scaled WHS data – you should obtain two sets of results that you
could compare and contrast.
prcomp)plot on
the result of prcomp)biplot. Which variables seem to predominantly drive the
results of PCA when applied to untransformed data?NTDinterventions and total population
biplot to generate substantial number of warnings. Usually
in R we should pay attention to these and understand whether they
indicate that something went wrong in our analyses. In this particular
case they are expected – why do you think that is?We get the error ‘Warning: zero-length arrow is of indeterminate angle and so skipped’, which is expected, because the zero-length arrows represent factors that don’t contribute, so they cannot be shown.
rotation in the output of prcomp
contains loadings of the 1st, 2nd, etc. principal components
(PCs) – that can interpreted as contributions of each of the attributes
in the input data to each of the PCs.Untransformed: NTDinterventions, Totalpopulation Scaled: Life expectancy F, HealthyLifeExpectancy
The untransformed observations are the same; however the scaled data shows completely different attributes.
#PCA
pca = prcomp(data) #untransformed
pcascaled = prcomp(data, scale = T) #scaled
#PCA plot
par(mfrow = c(1,2))
plot(pca, main = "PCA Results (Untransformed)")
plot(pcascaled, main = "PCA Results (Scaled)")
#2 first principal components
par(mfrow = c(1,2))
biplot(pca)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length = arrow.len): zero-length arrow is of indeterminate angle and so skipped
biplot(pcascaled)
#rotation
rot = pca$rotation
rotscaled = pcascaled$rotation
rot = rot[,1]
rot = abs(rot) #absolute values
rot[order(unlist(rot), decreasing = T)][1:2] #sort
## NTDinterventions TotalPopulation
## 0.999996189 0.002760906
rotscaled = rotscaled[,1]; rotscaled = abs(rotscaled)
rotscaled[order(unlist(rotscaled), decreasing = T)][1:2]
## LifeExpectancyF HealthyLifeExpectancy
## 0.2348898 0.2335479
#PVE 5 components
pcavar = pca$sdev^2
pve = pcavar / sum(pcavar)
pve[1:5]
## [1] 9.999917e-01 8.327189e-06 2.084286e-09 4.057653e-11 1.144625e-11
pcascaledvar = pcascaled$sdev^2
pvescaled = pcascaledvar / sum(pcascaledvar)
pvescaled[1:5]
## [1] 0.47276318 0.06964312 0.05258246 0.04669035 0.03854480
Now that you have PCA results when applied to untransformed and scaled WHS data, please comment on how do they compare and what is the effect of scaling? What dataset attributes contribute the most (by absolute value) to the top two principal components in each case (untransformed and scaled data)? What are the signs of those contributions? How do you interpret that?
The untransformed data PCA results uses mostly just NTDinterventions, whereas the scaled data spreads the results out more and has the top 2 attributes as LifeExpectancyF and HealthyLifeExpectancy, instead of NTDinterventions and totalpopulation. NTDinterventions and totalpopulation are both positive. LifeExpectancyF and HealthyLifeExpectancy are both negative, indicating a negative contribution to the other variables.
Please note, that the output of biplot with almost 200
text labels on it can be pretty busy and tough to read. You can achieve
better control when plotting PCA results if instead you plot the first
two columns of the x attribute in the output of
prcomp –
e.g. plot(prcomp(USArrests,scale=T)$x[,1:2]). Then given
this plot you can label a subset of countries on the plot by using
text function in R to add labels at specified positions on
the plot. Please feel free to choose several countries of your
preference and discuss the results. Alternatively, indicate US, UK,
China, India, Mexico, Australia, Israel, Italy, Ireland and Sweden and
discuss the results.
countries = c('US', 'UK', 'China', 'India', 'Mexico', 'Australia', 'Israel', 'Italy', 'Ireland', 'Sweden')
xcountr = subset(pca$x, row.names(pca$x) %in% countries)
plot(pca$x[,1:2], main = "Untransformed PCA")
text(xcountr[,1:2],
labels = row.names(xcountr))
xcountrscaled = subset(pcascaled$x, row.names(pcascaled$x) %in% countries)
plot(pcascaled$x[, 1:2], main = "Scaled PCA")
text(xcountrscaled[, 1:2],
labels = row.names(xcountrscaled))
Where do the countries you have plotted fall in the graph? Considering what you found out about contributions of different attributes to the first two PCs, what do their positions tell us about their (dis-)similarities in terms of associated health statistics?
In the untransformed plot, China is very high on PC2, indicating a very high population, and India is very high on PC1, indicating very high NTDinterventions. The rest of the data points are much lower on PC1 and PC2. The scalaed data shows a shape with both low and high PC1 correlating with low PC2, and mid PC1 correlating with high PC2. Sweden and Australia show low/negative PC1 and PC2, whereas India shows mid PC1 and high PC2, indicating higher healthy life expecatancy than female life expecatancy.
The goal of this problem is to practice use of K-means clustering and
in the process appreciate the variability of the results due to
different random starting assignments of observations to clusters and
the effect of parameter nstart in alleviating it.
Using function kmeans perform K-means clustering on
explicitly scaled (e.g. kmeans(scale(x),2)) WHS
data for 2, 3 and 4 clusters. Use cluster attribute in the
output of kmeans to indicate cluster membership by color
and/or shape of the corresponding symbols in the plot of the first two
principal components generated independently on the same (scaled WHS)
data. E.g.
plot(prcomp(xyz)$x[,1:2],col=kmeans(xyz,4)$cluster) where
xyz is input data.
#kmeans
k2 = kmeans(scale(data), 2)
k3 = kmeans(scale(data), 3)
k4 = kmeans(scale(data), 4)
plot(pcascaled$x[,1:2],col=k2$cluster, main = 'PCA 2 Clusters')
text(xcountrscaled[, 1:2],
labels = row.names(xcountrscaled))
plot(pcascaled$x[,1:2],col=k3$cluster, main = 'PCA 3 Clusters')
text(xcountrscaled[, 1:2],
labels = row.names(xcountrscaled))
plot(pcascaled$x[,1:2],col=k4$cluster, main = 'PCA 4 Clusters')
text(xcountrscaled[, 1:2],
labels = row.names(xcountrscaled))
Describe the results. Which countries are clustered together for each of these choices of \(K\)?
2K: 2 clusters split in half left/right, with all of the labeled countries in the left cluster, except for India. 3K: The first 2 clusters are skewed slightly to the left, and then the third cluster is more at the halfway point. Cluster 1 has Australia, Sweden, Italy, Isreal, Ireland. Cluster 2 only shows Mexico and China. Cluster 3 only shows India. 4K: Somewhat even cluster quadrants, with the same countries grouped as in 3K. However, there are no countries located in the fourth cluster that are labeled.
nstart parameter (15 points)By default, k-means clustering uses random set of centers as initial
guesses of cluster centers. Here we will explore variability of k-means
cluster membership across several such initial random guesses. To make
such choices of random centers reproducible, we will use function
set.seed to reset random number generator (RNG) used in R
to make those initial guesses to known/controlled initial state.
Using the approach defined above, repeat k-means clustering of
explicitly scaled WHS data with four (centers=4)
clusters three times resetting RNG each time with set.seed
using seeds of 1, 2 and 3 respectively (and default value of
nstart=1). Indicate cluster membership in each of these
three trials on the plot of the first two principal components using
color and/or shape as described above. Two fields in the output of
kmeans – tot.withinss and
betweenss – characterize within and between clusters
sum-of-squares. Tighter clustering results are those which have smaller
ratio of within to between sum-of-squares. What are the resulting ratios
of within to between sum-of-squares for each of these three k-means
clustering results (with random seeds of 1, 2 and 3)?
Seed 1 and 3 result in ratios of 1.04. Seed 2 results in a larger ratio of 1.12.
Please bear in mind that the actual cluster identity is assigned
randomly and does not matter – i.e. if cluster 1 from the first run of
kmeans (with random seed of 1) and cluster 4 from the run
with the random seed of 2 contain the same observations (country/states
in case of WHS dataset), they are the same clusters.
Repeat the same procedure (k-means with four clusters for RNG seeds
of 1, 2 and 3) now using nstart=100 as a parameter in the
call to kmeans. Represent results graphically as before.
How does cluster membership compare between those three runs now? What
is the ratio of within to between sum-of-squares in each of these three
cases? What is the impact of using higher than 1 (default) value of
nstart? What is the ISLR recommendation on this offered in
Ch. 10.5.1?
With nstart = 100, the ratio is 1.04 for all 3 cases. The clusters appear more consistent with the higher nstart, compared to nstart = 1. With nstart = 100, the plots look very similar, except the blue/green color is switched in one of them. With nstart = 1, there is more variability in the plotted clusters.
One way to achieve everything this sub-problem calls for is to loop
over nstart values of 1 and 100, for each value of
nstart, loop over RNG seeds of 1, 2 and 3, for each value
of RNG seed, reset RNG, call kmeans and plot results for
each combination of nstart and RNG seed value.
#plot K4 with seed = 1, 2, 3 & nstart = 1, 100
for(nstart in c(1, 100)){ #nstart loop
for(seed in 1:3){ #seed loop
set.seed(seed)
km = kmeans(scale(data), 4, nstart = nstart)
plot(pcascaled$x[, 1:2], col = km$cluster, main = paste("Seed:", seed, "nstart:", nstart))
text(xcountrscaled[, 1:2],
labels = row.names(xcountrscaled))
cat(paste('Seed:', seed, #print info
'nstart:', nstart,
'tot.withinss:', round(km$tot.withinss, 2),
'betweenss:', round(km$betweenss, 2),
'ratio:', round((km$tot.withinss / km$betweenss), 2), "\n"))
}
}
## Seed: 1 nstart: 1 tot.withinss: 3549.56 betweenss: 3398.44 ratio: 1.04
## Seed: 2 nstart: 1 tot.withinss: 3669.55 betweenss: 3278.45 ratio: 1.12
## Seed: 3 nstart: 1 tot.withinss: 3543.91 betweenss: 3404.09 ratio: 1.04
## Seed: 1 nstart: 100 tot.withinss: 3543.91 betweenss: 3404.09 ratio: 1.04
## Seed: 2 nstart: 100 tot.withinss: 3543.91 betweenss: 3404.09 ratio: 1.04
## Seed: 3 nstart: 100 tot.withinss: 3543.91 betweenss: 3404.09 ratio: 1.04
Cluster country states in (scaled) world health statistics data using
default (Euclidean) distance and “complete”, “average”, “single” and
“ward” linkages in the call to hclust. Plot each clustering
hierarchy, describe the differences. For comparison, plot results of
clustering untransformed WHS data using default parameters
(Euclidean distance, “complete” linkage) – discuss the impact of the
scaling on the outcome of hierarchical clustering.
for (m in c('complete', 'average', 'single', 'ward.D2')){
hier = hclust(dist(scale(data)), method = m)
plot(hier, main = m, labels = F, xlab = '') #remove names for readability
}
#untransformed
hieruntx = hclust(dist(data), method = 'complete')
plot(hieruntx,main = 'untransformed (complete)', labels = F, xlab = '')
The scaled, complete linkage shows a wide range of heights and branches spread out. Average linkage is skewed downwards with the higher branches all on the left. Single is similar to average, but much more narrow. Ward shows a more orderly split into somewhat even clusters. The untransformed plot shows one very even distribution all at the same height essentially.
Using function cutree on the output of
hclust determine assignment of the countries in WHS dataset
into top four clusters when using Euclidean distance and Ward linkage.
(Feel free to choose which one of the two varieties of Ward linkage
available in hclust you want to use here!). Use function
table to compare membership of these clusters to those
produced by k-means clustering with four clusters in the Problem 2(b)
when using nstart=100 (and any of the RNG seeds) above.
Discuss the results.
hierward = hclust(dist(scale(data)), method = 'ward.D2')
cut = cutree(hierward, k = 4)
k4 = kmeans(scale(data), 4, nstart = 100)
kmclust = k4$cluster #from problem 2
comp = cbind(cut, kmclust)
table(cut, kmclust)
## kmclust
## cut 1 2 3 4
## 1 41 0 0 0
## 2 0 0 0 71
## 3 0 44 0 10
## 4 12 0 2 14
41 observences are the same in cluster 1 for cutree and k-means clustering, and 14 are the same in cluster 4. 71 observances in cluster 2 for cutree are seen in cluster 4 for k-means. Cluster 4 shows a spread between each of the 2 methods. Cluster 3 for k-means only shows 2 observances that are seen in cutree, and they are in cluster 4 for cutree.